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Abstract — We present a distributed proximal-gradient 
method for optimizing tlie average of convex functions, eacli 
of whicli is the private local objective of an agent in a 
network with time-varying topology. The local objectives have 
distinct differentiable components, but they share a common 
nondifferentiable component, which has a favorable structure 
suitable for effective computation of the proximal operator. 
In our method, each agent iteratively updates its estimate of 
the global minimum by optimizing its local objective function, 
and exchanging estimates with others via communication in 
the network. Using Nesterov-type acceleration techniques and 
multiple communication steps per iteration, we show that this 
method converges at the rate 1/k (where k is the number of 
communication rounds between the agents), which is faster 
than the convergence rate of the existing distributed methods 
for solving this problem. The superior convergence rate of our 
method is also verified by numerical experiments. 

I. Introduction 

There has been a growing interest in developing dis- 
tributed methods that enable the collection, storage, and 
processing of data using multiple agents connected through 
a network. Many of these problems can be formulated as 

^ m 

min /(x) = — V'/i(x), (1) 

where m is the number of agents in the network, ](x) is 
the global objective function, and for each i — 1, . . . , m, 
li{x) is a local objective function determined by private 
information available to agent i. The goal is for agents to 
cooperatively solve problem ([T]). Most methods for solving 
this problem involve each agent maintaining an estimate 
of the global optimum of problem ([TJ and updating this 
estimate iteratively using his own private information and 
information exchanged with neighbors over the network. 
Examples include a team of sensors exploring an unknown 
terrain, where /^(a;) may represent regularized least-squares 
fit to the measurement taken at agent i. As another exam- 
ple, in a distributed machine learning problem, may 
represent a regularized loss function according to training 
samples accessible to agent i. 

Most optimization algorithms developed for solving prob- 
lem ([TJ and its variations are first-order methods (i.e., 
methods that use gradient or subgradient information of 
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the objective functions), which are computationally inex- 
pensive and naturally lead to distributed implementations 
over networks. These methods typically converge at rate 
Xj^fn, where n is the number of communication steps in 
which agents exchange their estimates; in other words, the 
difference between the global objective function value at 
an agent estimate and the optimal value of problem ([T]l is 
inversely proportional to the square-root of the number of 
communication steps carried out (see |[T] for a distributed 
subgradient method and for a distributed dual averaging 
algorithm with this rate). An exception is the recent inde- 
pendent work |[3), which developed a distributed gradient 
method with a diminishing step size rule, and showed that 
under certain conditions on the communication network 
and higher-order differentiability assumptions, the method 
converges at rate log(n)/n. 

In this paper, we focus on a structured version of prob- 
lem ([Tjl where the local objective function ji[x) takes the 
additive form ji{x) — giix) + h{x) with gi a differentiable 
function and h a common nondifferentiable function!]] We 
develop a distributed proximal gradient method that solves 
this problem at rate 1 jn over a network with time-varying 
connectivity. Our method involves each agent maintaining 
an estimate of the optimal solution to problem ([T]) and 
updating it through the following steps: at iteration fc, each 
agent i takes a step along the negative gradient direction 
of gi, the differentable component of his local objective 
function, and then enters the consensus stage to exchange 
his estimate with his neighbors. The consensus stage consists 
of k communication steps. In each communication step, the 
agent updates his estimate to a linear combination of his 
current estimate and the estimates received from neighbors. 
After the consensus stage, each agent performs a proximal 
step with respect to h, the nondifferentiable part of his 
objective function, at his current estimate, followed by a 
Nesterov-type acceleration step. 

This algorithm has two novel features: first, the multi- 
step consensus stage brings the estimates of the agents close 
together before performing the proximal step, hence allow- 
ing us to reformulate this method as an inexact centralized 
proximal gradient method with controlled error. Our analysis 
then uses the recent results on the convergence rate of an 
inexact (centralized) proximal-gradient method (see pl) to 

'This models problems in which h represents a nondifferentiable reg- 
ularization term. For example, a common choice for machine learning 
applications is h(x) = A||a;||-^, where is the sum of the absolute 

values of each element of the vector x. 



establish the convergence rate of the distributed method. 
Second, exploiting the special structure in the objective 
functions allows for the use of a proximal-gradient method 
that can be accelerated using a Nesterov acceleration step, 
leading to the faster convergence rate of the algorithm. 

Other than the papers cited above, our paper is related to 
the seminal works Q and |Q, which developed distributed 
methods for solving global optimization problems with a 
common objective (i.e., fi{x) — f{x) for all i in problem 
([TJ) using parallel computations in multiple servers. It is 
also related to the literature on consensus problems and 
algorithms (see |[7|-|10|) and a recent growing literature on 
multi-agent optimization where information is decentralized 
among multiple agents connected through a network (see 
|[TT]-|[15| for subgradient algorithms and |(T6), |jT7) for 
algorithms based on the alternating direction method of mul- 
tipliers). Finally, our paper builds on the seminal papers on 
(centralized) proximal-point and proximal-gradient methods 
(see ||T8)-||50)). 

The paper is organized as follows: Section 2 describes 
preliminary results pertinent to our work. In Section 3, we 
introduce our fast distributed proximal-gradient method and 
establish its convergence rate. Section 4 presents numerical 
experiments that verify the effectiveness of our method. 
Finally, Section 5 concludes the paper with open questions 
for future work. 
Notations and Definitions: 

• For a vector or scalar that is local, we use subscript(s) 
to denote the agent(s) it belongs to, and superscripts 
with parentheses to denote the iteration number; for 

(k) 

example, xl denotes the estimate of agent i at itera- 
tion k. 

• For a vector or scalar that is common to every agent, 
or is part of a centralized formulation, the iteration 
number is also written in superscripts with parentheses; 
for example, x'^'"^ denotes the average estimate of all 
agents at iteration k. Similarly, e'*"'^ and e*^*^' denote the 
errors in the centralized formulation at iteration k. 

• The standard inner product of two vectors x,y E M'* 
is denoted {x, y) = x'y. For x £ M'^, its Euclidean 
norm is ||a;|| = ^ {x, x), and its 1-norm is ||.t||-^ = 
X]f=i |2^(0I' where x{l) is its l-th entry. 

• For a matrix A, we denote its entry at the z-th row and 
j-th column as [A]ij. We also write [aij] to represent 
a matrix A with [A]ij — aij. A matrix A is said to be 
stochastic if the entries in each row sum up to 1, and 
it is doubly stochastic if A and A' are both stochastic. 

• We write o(n) = 0{b{n)) if and only if there exists 
a positive real number M and a real number no such 
that \a{n)\ < M\b{n)\ for all n > hq. 

• For a function F : R'^ -> (—00,00], we denote the 
domain of F by dom(F), where 

dom(i^) = {x e M'* I F{x) < 00}. 

For a given vector x E dom(F), we say that zp{x) E 
M'^ is a subgradient of the function at a; when the 



following relation holds: 

F{x)+{zp{x),y - x) < F{y) for all x E dom(i^). 

The set of all subgradients of at a: is denoted by 

dF{x). 

II. Preliminaries 

In this section, we introduce the main concepts and 
establish key results on which our subsequent analysis 
relies. Section 2.1 gives properties of the proximal operator; 
Section 2.2 summarizes convergence rate results for an 
inexact centralized proximal-gradient method characterized 
in terms of the errors introduced in the method. 

A. Properties of the Proximal Operator 

For a closed proper convex function h:W^^ (—00,00] 
and a scalar a > 0, we define the proximal operator with 
respect to h as 

prox^ja;} = argmin | h{z) H \\z — a;||^ 1 . 

It follows that the minimization in the preceding optimiza- 
tion problem is attained at a unique point y = prox^ja;}, i.e., 
the proximal operator is a single-valued map 1 18 1. Moreover, 
using the optimality condition for this problem 

OEdh{y) + -iy-x), 
a 

we can see that the proximal operator has the following 
properties |19|: 

Proposition 1: (Basic properties of the proximal opera- 
tor) Let /i : R'^ — (—00,00] be a closed proper convex 
function. For a scalar a > and x E M"^, let y — prox^ja;}. 

(a) We have -{x — y) E dh{y). 

(b) The vector y can be written as y = x — az, where 

z E dh{y). 

(c) We have h{u) > h{y) + ^ {x - y,u~ y) for all u E R'^. 

(d) (Nonexpansiveness) For a;, a; G M.'^, we have 

||prox^{a;} -prox^{x}|| < ||a;-i||. 

B. Inexact Proximal-Gradient Method 

Our approach for the analysis of the proposed distributed 
method is to view it as an inexact centralized proximal- 
gradient method, with the error controlled by multiple 
communication steps at each iteration. This allows us to 
use recent results on the convergence rate of an inexact 
centralized proximal-gradient method to establish the con- 
vergence rate of our distributed method. The following 
proposition from characterizes the convergence rate of an 
inexact proximal-point method in terms of error sequences 

{eC^)}^^! and {e^^^Hr^i- 

Proposition 2: |4l, Proposition 2] Let 5 : M'' ^- M be 
a convex function that has a Lipschitz continuous gradient 
with Lipschitz constant L, and let ft, : M'^ — > (—00, 00] be a 
lower semi-continuous proper convex function. Suppose the 
function / — g+h attains its minimum at a certain x* E W^. 



Given two sequences {e*-*^^}^]^ and {s^'''^}'^^^, where 
e(*^) £ and e g M for every k, consider the acceler- 
ated inexact proximal gradient method, which iterates the 
following recursion: 

'x^*-') e prox^^j^jyC^-i) - a {Vg{y^''-^^) + eC^) 
,(fe)_^(fc-i)\ 
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(fc) ^ ^(k) 



fe+2 



(2) 



where the step size is a 



and 



Hx) + —\\x-y\\ < min + — ||z-y|| j + e} 

(3) 

indicates the set of all e-optimal solutions for the proximal 
operator. 

Then, for all n > 1, we have 

2 



2L 



r.(0) 



X 



f{x{n))-f{x*)< 
where 



■ 2An + V2Br 



fc=l 



{n + iy 



E 

k=l 



fc2e(fe) 



Proposition [2] indicates that as long as the error sequences 
{Ih'-'^'' lllfc^i and {e^*'-'}^! are such that the sequences 
{fclle*^*^^ ||}^]^ and {kVe'^}'^i are both summable, then 
the accelerated inexact gradient method achieves the optimal 
convergence rate of 0{-^). It is straightforward to verify 
using the analysis in (4] that the result also holds for a 
constant step size a < j^. 

We shall see that error sequences in our inexact formula- 
tion, introduced by the distributed nature of our problem and 
controlled by multi-step consensus, can be bounded by se- 
quences of the form {p^*'''7'^}^i for some polynomial p'*^) 
of k and some 7 € (0, 1), which we shall henceforth refer 
to as polynomial-geometric sequences. The next proposition 
shows that such sequences are summable and allows us to 
use Proposition |2] in the convergence analysis of our method 
(the proof is omitted due to limited space). 

Proposition 3: (Summability of polynomial-geometric 
sequences) 

Let 7 be a positive scalar such that 7 < 1, and let 

P{k,N) = {cArfc^ + ... + cifc + co I e M,j = 0,...,A^} 

denote the set of all A^-th order polynomials of k, where N 
is a nonnegative integer Then for every polynomial p'*^' e 
P{k,N), 

00 

^pC^J-y'^ < 00. 

fc=0 

The result of this proposition for p{k, N) = k^ will be 
particularly useful for our analysis in the upcoming sections. 
Therefore, we make the following definition; 



C7 



^7' < 00. 



fc=0 



III. Model and Method 
We consider the optimization problem 



min f{x) 



-E/^(^)' 



(5) 



i=l 



where f{x) is the global objective function, and fi{x) = 
gi{x) + h{x),i — l,..,m are local objective functions 
that are private to each agent. For example, for regular- 
ized logistic regression, the local objective functions are 
given by g,{x) = ji^ Ejew, + exp(-fei (oj , x))) 

and h{x) — A||a;||j^, where Ni is the training dataset of agent 
i, corresponding to {aj \ j £ Ni}, the set of feature vectors, 
and {bj \ j G N^}, the set of associated labels. 

We adopt the following assumption on the functions gi (x) 
and h{x). 

Assumption 1: (a) For every i, gt : M'' — > M is con- 
vex, continuously differentiable, and has a Lipschitz- 
continuous gradient with Lipschitz constant i > 0, i.e., 

\\Vg^ix)-Vg,iy)\\ < L\\x - y\\ for all x, 2/ G M'^. 

(b) There exists a scalar Gg such that for every i and for 
every x e M"', ||Vg,(x)|| < Gg. 

(c) /i : M'' — )• M is convex. 

(d) There exists a scalar Gh such that for every x G M.'^, 
\\z\\ < Gh for each subgradient z g dh{x). 

(e) .fix) = E™ 1 fi{x) = ^ E" 1 9^{x) + h,{x) attains 
its minimum at a certain x* . 

These assumptions are standard in the analysis of dis- 
tributed first-order methods (see |j6), pT| and |[l)). 

We propose the following distributed proximal-gradient 
method for solving problem J5]l: Starting from initial esti- 



(fe-i) 



mates {y\ }i=i,...,m with ?/,■ ' £ M , each agent i updates 
his estimate y\ 



at iteration k as follows: 

(fc-l) „ , (fc-l)N 



(k) (k) 

ij 



(fc) 



X 



(fc) 



—2V 



(6a) 
(6b) 

(6c) 
(6d) 



Here, a > is a constant stepsize which is also the 
constant scalar used in the proximal operator. The scalars 
A.^*^'* are weights given by 



for alH, j = 1 . . . , m and all k > s, where i^*^) is the total 
number of communication steps before iteration fc, and <i> 
is a transition matrix representing the product of matrices 
A{t), i.e., for t > 0, 

s) = A{t)A{t - 1) • • • A{s + l)A{s), 



(4) where A{t) = [a^J{t)]^.J 
aij{t) > for i,j = 1, 



=i....,m is a matrix of weights 
. . , m. Using a vector notation 



rewrite dSbll as 



and qC^) = [gf 



we can 



Hence, this step represents agents performing k commu- 
nication steps at iteration k. At each communication step, 

(k) 

agents exchange their values q,- and update these values by 
linearly combining the received values using weights A{t). 
We refer to ( |6b] l as a multi-step consensus stage, since linear 
(in fact, convex, as we shall see) combinations of estimates 
will serve to bring the agent estimates close to each other 
Our method involves each agent updating his estimate 
along the negative gradient of the differentiable part of his 
local objective function (step ([Sa])), a multi-step consensus 
stage (step (|6b]i), and a proximal step with respect to the 
nondifferentiable part of his local objective function (step 
(|6c|)), which is then followed by a Nesterov-type acceleration 
step (step (|6d|). Hence, it is a distributed proximal-gradient 
method with a multi-step consensus stage inserted before 
the proximal step. The multi-step consensus stage serves to 
bring the estimates close to each other before performing the 
proximal step with respect to the nondifferentiable function 
h. This enables us to control the error in the reformulation 
of the method as an inexact centralized proximal-gradient 
method. 

We analyze the convergence behavior of this method 
under the information exchange model developed in ||T), ||5), 
which we summarize in this section. Let A{t) — [a,ij(t)] 
be the weight matrix used in communication step t of the 
consensus stage. While the weight matrix A[t) may be time- 
varying, we assume that it satisfies the following conditions 
for all t. 

Assumption 2: (Weight Matrix and Network Conditions) 
Consider the weight matrices A{t) = [aij{t)],t — 
1,2,.... 

(a) (Double stochasticity) For every t, A{t) is doubly 
stochastic. 

(b) (Significant weights) There exists a scalar 77 e (0, 1) 
such that for all i, aii{t) > t], and for j 7^ i, either 
flij (t) > T], in which case j is said to be a neighbor of 
i, and receives the estimate of i, at time t; or (t) — 0, 
in which case j is not a neighbor of i at time t. 

(c) (Connectivity and bounded intercommunication inter- 
vals) Let 

Et —{{j, i) I i receives the estimate of i at time <}, 
Eoa —{{j^ i) I j receives the estimate of i for infinitely 
many t}. 

Then Eoo is connected. Moreover, there exists an integer 
B >1 such that if (j, i) € E^o, then (j, i) G Et^JEt+iU 
... U Et+B-i- 

In this assumption, part (a) ensures that each agent's 
estimate exerts an equal influence on the estimates of others 
in the network. Part (b) guarantees that in updating his 
estimate, each agent gives significant weight to his current 



estimate and the estimates received from his neighbors. Part 
(c) states that the overall communication network is capable 
of exchanging information between any pair of agents in 
bounded time. An important implication of this assumption 
is that for B = {m- 1)B, then Et U Et+i U ... U E^^g_^ = 

Boo- 

The following result from 1 1 1 on the limiting behavior of 
products of weight matrices will be key in establishing the 
convergence of our algorithm in the subsequent analysis. 

Proposition 4: |1, Proposition 1(b)] Let Assumption 2 
hold, and for t > s, let 

s) = A{t)A{t -!)■■■ A{s + l)A{s). 

Then the entries s)],y converges to ^ as i — cx) with 
a geometric rate uniformly with respect to Specifically, 
for all i,j E {1, ...,rn} and all t, s with t > s, 



1 



< 2- 



1 



For simplicity, we shall denote F 



IT u - 



2i±rr 



1 



, and restate this theorem as 



1 

m 



< r7* 



(7) 



This lemma will ensure that the distance between each 
agent's estimate and the average estimate decreases geo- 
metrically with respect to the number of communication 
steps taken in the consensus stage. In particular, it gives the 

(k) 

following bound on the distance between iterates , the 
outcome outcomes of the multi-step consensus stage ( |6b] i, 
and their average, g'*^' = ^ o[t^'- 



(fc) Jfe) 



1 



(fe) 



(8) 



^ m 



< 



r/E 



(9) 



A. Formulation as an Inexact Method 



We now show that our method can be formulated as 
an inexact centralized proximal gradient method in the 
framework of |4|: 

Proposition 5: (Distributed proximal-gradient method as 



an inexact centralized proximal-gradient method) Let x 
and y^'^^ be iterates generated by Algorithm (|6]). Let x'^'^^ 



(fc) 



Em 



and y 



(fc) 



Vi''^ be the average 



iterates at iteration k. Then Algorithm (|6| can be written 



rik) 



e Prox^.eC^) {y 
+ ^(x 

^ fc+2 V-^ 



(fc-1) 



(fe) 



X 



(fe-i)-) 



(10) 



where the error sequences {e^''^}'^^^ and {e*-'^-'}^! satisfy 



,(fc) 



m 



1=1 



(11) 



- -El 



(12) 



Proof: By taking the average of ( |6a| l, we can see that 

^W=y(''-i)-a(Vff(y('=-i)) + eW), 

where 



and therefore, due to the Lipschitz-continuity of the gradient 

of gi{x). 



<-E| 
1 



TO 



Let 

.(fe) - 



proxJ^{(7^'^-'} = argniin <^ 



2a 



denote the resuh of the exact centrahzed proximal step. Then 



T.T=iA = ;^E"iProx^{i the result of 
the proximal step in the distributed method, can be seen as 
an approximation of z^''^ We next relate z^^'^ and x'^'^'^ by 
formulating the latter as an inexact proximal step with error 
e''''^. A simple algebraic expansion gives 



+ 2/zW 



1 

2a 
Gh 



2a 



: min < ft,(z) 



1 

2a 



(fc) 



+ 



X 

1 

2a 



(fc) _ ^(fe) 



z - g 



a 



where in the inequality, we used the convexity of h{x) and 
the bound on the subgradient dh{x'^''^) to obtain < 
h{z^^^) + G^llx'-'^' — z^*^)]!; and in the equality, we used 
the fact that by definition, z^^^ is the optimizer of h{x) + 



2a I 



-gwir. 

With this expression, we can write 



x^*"^ e prox 



where 



1 

2a 



Gh 



xW - z(^) 



By definition, z^'^^ = prox^jg*^'^-'} also implies 
i l^g(fc) _ zC^)^ £ c)/i(z('^)), and therefore its norm is 
bounded by Gh- As a result. 



e^*^) < 2Gh 


xC') - zC^) 


1 








+ 2^ 





Combined with the nonexpansiveness of the proximal 
operator (Proposition [TJd)), 





1 


< 






TO 




1 


< 






TO 



we arrive at the desired expression. 

■ 

This proposition shows that the two error sequences 
lle^'^^ll and e'*^^ have upper bounds in terms of 



Em 



(fe-1) 



spectively^ which are in turn controlled by the multi- 
step consensus stage. According to Q Proposition 2], if 
{fc||e('^)||} and {kVeWf} are both summable, then the 
inexact proximal-gradient method exhibits the optimal exact 
convergence rate of 0{l/n^). In the following sections, we 
shall see that this is indeed the case. 

B. Convergence Rate Analysis 

We next show that the sequences {||e*^'^''||}^j^ and 
{^'^'^''Ifc^i ^re bounded above by polynomial-geometric se- 
quences. By Proposition [3] this establishes that the se- 
quences {A;||e'-'''' ||} and {/cn/eC^)} are summable. We first 
present some useful recursive expressions of the iterates. 
Proposition 6: (Recursive expressions of iterates) 
Let sequences {x[''^T=i, {vi'^T^,, Ul'^}T=i^ 
* = I7 ■■■,m, be iterates generated by Algorithm 
(|6|. For every fc > 2, we have 



(a) Er=i it 



Em 

(b) Er=i 



X 



< Er=i 

(fc-i) 



+ aTO(Gg + Gh) + 



{k - l)am(Gg + Gh) 



(c) 



Vi -y 



(fc) 



< 



< STOrELiVEr^ilk; 



(fc-i) 



The proof is given in the online appendix. These recursive 



expressions allows us to bound X^i^li 'li^'' 
order polynomial of fc, as in the following 
proof is also given in the appendix. 
Lemma 1: (Polynomial bound on 



with a second- 
lemma, whose 



) 

Let sequences {xT'}^^,. {vDT^i^ {<lt^}T=i, 
{^i'^^lfcLi; * = Ij-.-jTi, be generated by Algorithm (|6|. 
Then there exists scalars Cq,C'g, C'^ such that for fc > 2, 



E 



— C qk -\- Cq k . 



We now apply Lemma [T| on the error sequences in ( fTO] ) to 
show that {fcjle^'"^ ||}^^ and {kVeW)}'^^^ are polynomial- 
geometric sequences, thus summable: 

Lemma 2: (Summability of ||}^]^ and 

In the formulation ( [TO] i, where 



,(fc) 



7(fe-l) 



m 

1=1 

2G,. 



TO 



^(fc) _ ^(fc) 



- -El 

2a \ TO ^ 



Therefore, fcVe^ is also a polynomial-geometric se- 
quence. 

■ 

Using the lemma above, we can establish the convergence 
rate of our distributed proximal-gradient method: 

Theorem 1: (Convergence rate of the distributed 
proximal-gradient method with multi-step consensus) Let 
{^I'^'lfe^i be iterates generated by Algorithm (|6|, with 
a constant step size a < 1/L where L is the Lipschitz 
constant in Assumption [ij Let x^^^ — ^ X]"=i ^i*^^ 
average iterate at iteration k. Then, for all i > 1, where t 
is the total number of communication steps taken, we have 

/(5j«)-/(x*) = o(iA). 



we have 

(a) Er=ifc||e^'''|| < 

(b) Er=i fc^/e^ < °o 

Proof: In both cases, it suffices to show that the 
sequence is a polynomial-geometric sequence. The result 
then follows by Proposition [3] 
(a) By Proposition |6|c). 



X^) _ y(fc) 



m ^-^ II 

ni 



and by Lemmajlja), Y."U f^f ^|| < + C^A: + C^'P 
Therefore, 



Proof: Since it takes k communication steps to com- 
plete iteration fc, the total number of communication steps 

n f 1 ) 

required to execute iterations l,...,n is '}2i k — "^"^'^ . 

fc=i 

In other words, after t communication steps, the number 
of iterations completed is n, where n is the greatest in- 
teger such that "^"g^"'"'' = " ^" < t, ox equivalently. 



-l + Vl+8t 



As a result. 



> 



2 + 8t - 2Vl + 8t 



and thus. 



D 



< 



2D 



(n + 1)2 - 4t + 1 - y/TT8i 



= om. 



<ALTj''k {Cg + C'gk + C'^k^'^ 



(b) Recall ^ 

Lemma Ha), J^Jli 



Although this theorem is stated in terms of x^*\ it could 
also lead to a bound on /(a;,-''') — fix*), using the gradient 
+ 2LV-i^^^k {Cq + C'gik - 1) + C'^ik - if) |3ound, nonexpansiveness of the proximal operator, and (|9|. 

Also, note that the results above hold for the fast distributed 
gradient method (where the objective functions are differen- 
tiable), which is clear by simply setting h{x) — 0, G/j = 
and eC') = 0. 



which is a polynomial-geometric sequence. 



Therefore, 



and 



^3 

< C, + C'k + C'P 



e^'^) <2G^r7'= (G, + G^fc + G^'fc^) 

+ ^[r7'= {Cq + c'gk + c';e)]\ 

Using the fact that \/a + b < ^/a + \/h for all nonneg- 
ative real numbers a, 6, we have 

<^2G,,r7fe (G, + G^fc + G;'P) 



1 



V2a 

r 



[r7'= (G, + G;fc + G;'fc2)] 



C'qk 



IC'^k 



+ ^7' (G, + g;^ + G;'fc2) 

V 2a 

where in the last line we used the fact that \/fc < k 
for all fc > 1. This is a polynomial-geometric sequence. 



C. fieyont/ 0(1 /t) 

We have thus shown that taking k communication steps 
in the fc-th iteration of Algorithm (|6]l results in the summa- 
bility of error sequences {fc||e''^' || }^]^ and {k^fe^}^^^. 
A natural question arises: can we do better? In particular, 
will the error sequences still converge if we took less than k 
communication steps in the fc-th iteration? We address this 
question in this section. 

Let Sfc be the number of communication steps taken in 
the multi-step consensus stage at iteration fc. In our method 
presented earlier, Sk = k. We wish to explore smaller 
choices of Sk that still preserves the guarantee for exact 
convergence. 

With Sfc, we see that ([SJ can be written as 



,(k) 
1i 



4k) 



\1' 



(k) 



Therefore, Proposition |6|b) becomes 



i=l 
k-1 



1=1 



El 



+ {k-l)am{Gg + Gh) 



As a result, if we have the equivalent of Proposition [3] for 
Sfc, i.e., if X^fc^o ^^7*'° < oo for any given 7 e (0, 1) and 
nonnegative integer N, then Lemma [T] would hold, and so 
would Theorem [T] 



Since J2 



fe=0 



< 00 for a < 



1, a sufficient con- 
dition for the above is 7'^'' < k^^^^, or equivalently, 
Sk > log k- This is at the order of O(logfc), which 

is smaller than our previous choice of s/^ ~ k = 0'^^\ The 
hidden constant, ~]^~^ , depends on N and 7. In our case, 
we only require this condition to hold up to = 3. There- 
fore, if 7 is known, by choosing Sk = log(fc + 1) 
the distributed proximal-gradient method is guaranteed to 
converge with rate 0{l/n?), where n is the iteration number 
The time it takes to complete iterations 1 , . . . , n, which we 
denote by T(n), is then 



^("') — E/ — 0{n log n 



k=l 

since Jlogxdx — x(loga; — 1). Unfortunately, nlogn = 
logn • e'°s" has no explicit inverse expression f22\. There- 
fore, we can only express the convergence rate as 

/(^W)-/(x*) = o(i/(r-iW)^) 

which we know to be better than 0{l/t), since T{n) is 
bounded above by 0{n^). 

In closing, we remark that the improved choice of Sk 
above requires the knowledge of 7, which may not be readily 
available if detailed information or performance guarantees 
of the communication network is unknown. In such cases, 
the method could still be implemented with Sk = k. 

IV. Numerical Experiments 

Our theoretical findings are verified with numerical ex- 
periments on a machine learning task using 20 Newsgroups 
p3) , p4) , a benchmark dataset for text categorization. 
It consists of about 20,000 news articles, evenly chosen 
among 20 topics. The task is to perform Li-regularized 
logistic regression on the training data, so as to learn the 
classification model for a chosen topic label. Specifically, 
we wish to minimize 

N 



1 

N 



where N is the total number of news articles, aj is the 
86 15 -dimensional feature vector of article j, and hj is its 
the label for the chosen topic, which is equal to 1 if this 
article belongs to the topic, and —1 otherwise, x contains 
parameters of the classification model that we wish to learn, 
and /(x) is its corresponding regularized loss function. 



We distribute the training data across a network of m = 
10 data centers, each with 1129 samples. Thus, each data 
center has the following private objective function: 
1 



E 



log(l + exp(-&j {aj.Xi))) + A||xi||^ 



where iV^ is the subset of data at center z, and Xi,i = 
l,...,m, is its local estimate of the global classification 
model. In each communication step, a weight matrix is ran- 
domly chosen from a pool of 10 weight matrices generated 
from connected random graphs. All weight matrices satisfy 
Assumption |2] 

To demonstrate the effect of using multiple communica- 
tion steps after the gradient step in our method, we compare 
it with the following methods: 

• The basic subgradient method with single-step consen- 
sus in yj: 



(fc) 
„{fe) 



■w 

E 



(fc-i) 



a (\7gi{w. 

(k) (k) 

('^-'^^ e 5/i(u;f~'') and 



(13) 
is the 



where zii{w 

randomly-chosen weight matrix. 
The basic proximal-gradient method with single-step 
consensus, similar to that of 111: 



(fe) 



= prox^{u>. 



(fc-i) 



(fc) (fc) 



(14) 



where 



is the randomly-chosen weight matrix. 

The accelerated proximal-gradient method with single- 
step consensus: 

(fc) ar (fc-l) v7 / (fc-l)M 



y 



(fe) 

i 

„(*) 



I k-l,(k) 



_ M 

•^i ' fc + 2 



(15) 



The accelerated proximal-gradient method with multi- 
step consensus which is not inserted between the gra- 
dient and proximal steps, but instead performed only 
after the proximal step: 



(fc) ar (fc-l) VT ^ (fc-l)M 



(fe) 

(k) 



(fe) I 



fe-1 (^(k) 
fe + 2 



(k-1) 



Em \ X - / 
,=1 y 



(fe) (fe) 
'j yj 



(16) 



where 



(fc) 



is the product of k weight matrices 
randomly drawn from the pool of 10 weight matrices. 
Figure [T| shows convergence rate results for each method. 
It is clear from the figure that Algorithm ( [T3| ) converges 
to an error neighborhood at rate 0{l/t), as shown in Q. 
Algorithms ([T4]l and ^T5\ also converge to an error neighbor- 
hood, but the latter exhibits more oscillation than the basic 
methods. Algorithm ([T6]l converges with rate 0{l/t), but 
only to an error neighborhood instead of achieving exact 
convergence. This highlights the importance of having the 
consensus step before the proximal step instead of after 
it. Finally, our accelerated multi-step method attains exact 
convergence with rate 0{l/t), outperforming all others. 




V. Conclusion and Future Work 

We presented a distributed proximal-gradient method that 
solves for the optimum of the average of convex functions, 
each having a distinct differentiable component and a com- 
mon nondifferentiable component. The method uses multiple 
communication steps and Nesterov's acceleration technique. 
We established the convergence rate of this method as 
0{l/t) (where t is the total number of communication 
steps), superior to most existing distributed methods. 

Several questions remain open for future work. First, it 
would be useful to generalize the result for the case where 
the nondifferentiable functions hi{x) are distinct. Secondly, 
it is of interest to determine the condition under which 
the accelerated single-step proximal-gradient method ([TSj 
converges, and compare its performance with our multi-step 
consensus method. Last but not least, it would be useful to 
obtain a lower bound on the convergence rate of distributed 
first-order methods under our current framework. 
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Appendix 

Proof of Proposition [6] 

Throughout this proof, let j3k — |^ for simpUcity. 
Moreover, it is useful to note that, by Proposition [T] ( |6c] i 
could be written as 

xf^ = - az,|'\ where zf'^ e dh (xf^'^ . (17) 



Since h has bounded subgradients, this also implies 

\xf^ -qf^\<aGh. (18) 
(a) Taking norm of ((6a]) and summing over i, we have 



(19) 



amGn, 



where we used the gradient bound in Assumption [T] 
According to (|6d]), we have 



and by ( [T8| ), we have 



Therefore, 



(fe-i) 



< 



< 



(fc-i) .(fc-i) 



+ aGh + l3k-j 



(fc-l) (k-2) 

X: — X: 



(20) 



^ q 



Next, we use ( |6b| l, which states that 

is a convex combination of 



Ell -('^-1)1 



{#-'^}r=i, so 



^El#-^^ 

z=l " i=l 

Substituting (|20]l-(|2T} back in ([19]), we have 



(21) 



Elkfl 



■A-iEl 



Subtracting from the previous expression and 

taking the sum of the norm, we have 
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El 



\x^-x^^'^ 
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i=l 
m m 

EEa 

i=l J=l 



(fc-1) (fc-1) (fc-1) 



■Pk-.Y. 



Sk-l) (k-2) 



+ am{Gg + Gh), 
(22) 



where we used the convexity of the norm operator along 



with the fact that J^Zi ^ij^^^ = 
Now consider J2T=i E7=i 



1. 



k-i (fc-i) 



the expression above. By the nonexpansiveness of 
the proximal operator, we have 



^{k-i) .(fe-i) 



(fc-i) _ (fc-i) 



< 



Using the fact that 
is doubly stochastic, we have 



(fe-i) 



Uk-l) 



EEa 

i=l j=l 

in m 



(fc-l) 4k-l) 



^ik-l)_-ik-l) 



The right-hand side can in turn be bounded with ([HJ. As 
a result. 



EEa^ 

i=l 3 = 1 



(fe-i) (fe-i) (fc-i) 



a;- 



(fc-i) 



(23) 



Substituting this back to ( |22) t, 

ra 

El 



(fe) (fc-i) 



Finally, we omit /3fc_i < 1, and increment the indices 
by 1 so that the expression is applicable to fc > 2. 
(b) Starting with ( [T7] i and applying ( j&b] ), (]6a]i, (]6d| in order, 
we have 



(fc-i) (fc) (fc) 
\^ 'q) ' ~ az\ 



^Ea 



^;(fc-l) 



(fc-i) 



■ E 



ffc-l) ^(fc-2) 



+ am{Gg + Gh) 



fc-i 



7fi. 



< ^ 2mr7' J2 H II + "^(^3 + Gh 
1=1 \ j=i 

where the final line is due to recursion, and omitting 
/3/ < 1 for Z > 1 while using /3i = to eliminate 
the tailing term X^I^li ||^i "■^i'll- ^^^^ desired 
expression, 
(c) By (jedll. 



Note 
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i=l 2^1 = 1 



also that 

(k) 



(fc) 



a similar reasoning as that of ([23|). Therefore 
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by 



<(l + /3fc) 



■A 
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j=i i=i 
Omitting (3^ < 1 gives statement (c). 



Proof of Lemma |7] 

We proceed by induction on k. First, we show that the re- 
sult holds for fc = 2 by choosing Cq = J2'iLi ■ It suf- 
fices to show that, given the initial points ?/°, X]j=i ^j^^ 
is bounded. 

Indeed, by ([T9|, 



Comparing coefficients, we see that the right-hand side 
can be bounded by Cg + C'q{k + 1) + C'^{k + If if 
am{Gg + Gh) < 2C^' for the coefficient of fc, and 
2mr (C,5J + C;57 + C^'S*^) < + C;' for the constant 
coefficient. Therefore, the induction hypothesis holds for 
fc + 1 if we take 



2mTCqS2 + {2mTSl - l)C'^ 
2mrSj - 1 ' 
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+ am{Gg + Gh) < oo 



where the first line is due to the fact that /3i = so 
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(1) 



, and the second line is because of ([TTji and 



,(2) 



< CXI is a valid choice. 



Therefore, = Y.T=i 

Now suppose the result holds for some positive integer 
fc > 2. We show that it also holds for fc + 1. 

Substituting the induction hypothesis for fc into Proposi- 
tion [6|b), we have 
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1=1 



By Proposition |3] and expression there exists constants 
S^Q, 81,3:1 such that 



^ 7' {Cq + c;/ + < CqS^^ + c;57 + c'^s, 



1=0 



Proposition [6|a) and the induction hypothesis then gives 
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<Cq + G'qk + C'k^ + am{Gg + G,,) 



+ 2mT (G,5^ + G'^SJ + G'^S, 
+ (fc - l)am{Gg + Gh) 



